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E"' . Abstract 

. We focus on the supervised binary classification problem, which consists in guessing the 

I label Y associated to a co-variate X G R"^, given a set of n independent and identically 

^ ■ distributed co-variates and associated labels (Xj,!^). We assume that the law of the 

random vector {X, Y) is unknown and the marginal law of X admits a density supported 
on a set A. In the particular case of plug-in classifiers, solving the classification problem 
' boils down to the estimation of the regression function r]{X) = K[Y\X]. Assuming first 

A to be known, we show how it is possible to construct an estimator of rj by localized 
PsJ , projections onto a multi-resolution analysis (MRA). In a second step, we show how this 

I estimation procedure generalizes to the case where A is unknown. Interestingly, this 

novel estimation procedure presents similar theoretical performances as the celebrated 
local-polynomial estimator (LPE). In addition, it benefits from the lattice structure of 
the underlying MRA and thus outperforms the LPE from a computational standpoint, 
which turns out to be a crucial feature in many practical applications. Finally, we prove 
that the associated plug-in classifier can reach super-fast rates under a margin assumption. 

X : AMS 2000 SUBJECT CLASSIFICATIONS: Primary 62G05, 62G08; Secondary 62H30, 62H12. 
. P. , 

Key- Words: Nonparametric regression; Random design; Multi- resolution analysis; Super- 
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1 Introduction 
1.1 Setting 

The supervised binary classification problem is directly related to a wide range of applications 
such as spam detection or assisted medical diagnosis (see [25, chap. 1] for more details). It can 
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be described as follows. 



The supervised binary classification problem. Let S stand for a subset of M and write y = 
{0, 1}. Assume we observe n co-variates Xi & S and associated labels Yi & y such that the 
elements of Vn = {{Xi, Yi),i = 1, . . . , n} are n independent realizations of the random vector 
(X, Y) & S xy of unknown law Px,y- Given D„ and a new co-variate Xn+i, we want to predict 
the associated label Yn+i so as to minimize the probability of making a mistake. 

In other words, we want to build a classifier hn '■ S ^ y upon the data Vn, which minimizes 
F{hn{X) 7^ Y\Vn)- It is well known that the Bayes classifier h*{T) := l{r)(T)>i/2}, where 
?7(r) := E[y|X = r] = ¥{Y = 1\X = t) (unknown in practice), is optimal among all classifiers 
since, for any other classifier we have £(/i„, h*) := P(/i„(X) ^ - ¥{h*{X) ^ F) > 

(see [12]). As a consequence, we measure the classification risk ^{hn) associated to a classifier 
hn as its average relative performance over all data sets P„, ^(/i„) = E®"£(/i„, h*). As described 
in [12, Chap. 7], there is no classifier hn such that ^{hn) goes to zero with n at a specified rate 
for all distributions f'x,Y- We therefore make the assumption that Px,y belongs to a class of 
distributions V (as large as possible) and aim at constructing a classifier /i„ such that 

inf sup ^i9n)< sup i7(M < (logn)^inf sup ^(On), n > 1, (1) 

where the infinimum is taken over all measurable maps 6n from S into y and < means lesser 
or equal up to a multiplicative constant factor independent of n. Any classifier hn verifying 
eq. (1) will be said to be (nearly) minimax optimal when S = {6 > 0). V will stand for the 
set of all distributions such that the marginal law of X admits a density fi on S and t] be- 
longs to a given smoothness class. Throughout the paper, we will denote by fi the density of Px- 

Many classifiers have been suggested in the literature, such as fc-nearest neighbors, neural 
networks, support vector machine (SVM) or decision trees (see [12, 25]). In this paper, we will 
exclusively focus on plug-in classifiers /?.„(r) := l{ry„(r)>i/2}; where rjn stands for an estimator 
of rj. With such classifiers, it is shown in [48] that, 

^(/i„) <2E®"E|r/„(X)-r7(X)|, (2) 

where the term on the rhs is known as the regression loss (of the estimator ?7„ of rj) in Li(£, fi)- 
norm. Eq. (2) shows in particular that rates of convergence on the classification risk of a plug-in 
classifier hn can be readily derived from rates of convergence on the regression loss of This 
prompts us to focus on the regression problem, which can be stated in full generality as follows. 

The regression on a random design problem. Let S, y stand for subsets of R"' and M, respec- 
tively. Assume we dispose of n co-variates Xi & S and associated observations Yi ^ y such that 
the elements of P„ = {{Xi, Yi),i = 1, . . . ,n} are n independent realizations of the random vec- 
tor {X,Y) eSxy of unknown law ¥x,y- We define ^ := Y-r]{X), where r/(r) := E[r|X = r], 
so that by construction E[^|X] = 0. Given P„ and under the assumption that Px,y belongs 
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to a large class of distributions V, we want to come up with an estimator //„ of rj, which is as 
accurate as possible for the wide range of losses ^(//n) = lE®"lE|?7n(-^) — '>l{^)\^^ P > 1- 

As described previously, in the particular case where y = {0, 1}, we fall back on the regression 
problem associated to the classification problem with plug-in classifiers. In this case, ^ is 
bounded such that |^| < 1. Notice however that the regression on a random design problem 
stated above permits for y to be any subset of M (including M itself). To be more precise, and 
by analogy with eq. (1), our aim is to build an estimator rjn of t] such that, for all p > 1, 

inf sup ^p(^„)< sup ^p(r7„) < (logn)^nf sup n > 1, (3) 

where the infinimum is taken over all measurable maps On from £ into y. And rj^ will be said 
to be (nearly) minimax optimal when 5 = (5 > 0). 

1.2 Motivations 

Many estimators of rj have been suggested in the literature to solve the regression on a 
random design problem. Among them, the celebrated local polynomial estimator (LPE) has 
been praised for its flexibility and strong theoretical performances (see [45, 46]). As is well 
known, the LPE is minimax optimal in any dimension c? G N and for any ^-loss, p G (0, oo], 
over the set of laws V such that (i) /i is bounded from above and below on its support A := 
Supp/i = {r : /i(r) > 0}, (ii) t] belongs to a Holder ball ^''(£,M) of radius M and (iii) ^ has 
sub-Gaussian tails. As a drawback, the LPE is computationally expansive since it requires to 
perform a new regression at every single point x & A where we want to estimate rj. 
Computational efficiency is however of primary importance in many practical applications. 
In this paper, we show that it is possible to construct a novel estimator rjn of 7] by localized 
projections onto multi-resolution analysis (MRA) of L2(M'^, A) (where A stands for the Lebesgue 
measure on £), which presents similar theoretical performances and is computationally more 
efficient than the LPE. 

1.3 The hypotheses 

In this section, we summarize the assumptions on /x. A, rj and ^ that will be used throughout 
the paper. 

Assumption on /i. Let us denote by /imin, A^max two real numbers such that < /imin < /^max < 
oo. As is standard in the regression on a random design setting, we assume that the density fi 
is bounded above and below on its support A. 

(Dl) /imin < /i(^) < A^max for all T eA. 

This guarantees that we have enough information at each point x G ^ in order to estimate r) 
with best accuracy. For a study with weaker assumptions on /i, the reader is referred to [17, 19], 
for example, and the references therein. 
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Assumption on A. We first assume that, 
(SI) A = £= [0,1^. 

Therefore A is known under (SI). We will deal with the case where A is unknown in Section 9. 
Assumption on t]. Fix r G N. In the sequel, we will assume that, 

(Hg) The regression function rj belongs to the generalized Lipschitz ball ^'^{S,M) of radius 
M, for some s G (0, r). 

Unless otherwise sated, s is unknown but belongs to the interval (0,r), where r is known. 
For a detailed review of generalized Lipschitz classes, the reader is referred to the Appendix 
below. 

Assumptions on the noise ^. We will consider the two following assumptions, 

(Nl) Conditionally on X, the noise ^ is uniformly bounded, meaning that there exists an 

absolute constant K > such that |.^| < K. 
(N2) The noise ^ is independent of X and normally distributed with mean zero and variance 

cr^, which we will denote by ^ ~ $(0, cr^). 
Assumption (Nl) is adapted to the supervised binary classification setting, where y = {0, 1}, 
while (N2) is more common in the regression on a random design setting, where 3^ = M. 

Combination of assumptions. In the sequel, we will conveniently refer by (CSl) to the set of 
assumptions (Dl), (SI), (Nl) or (N2). As detailed below in Section 3, configuration (CSl) 
is comparable to what is customary in the regression on a random design setting. 

2 Our results 

Assuming at first A to be known, we introduce a novel nonparametric estimator i]® of i] built 
upon local regressions against a multi-resolution analysis (MRA) of L2(M'^, A) and show that, 
under (CSl), it is adaptive nearly minimax optimal over a wide generalized Lipschitz scale 
and across the wide range of losses 'Lp{S,fi),p G [l,c>o). We subsequently show that these 
results generalize to the case where A is unknown but belongs to a large class of (eventually 
disconnected) subsets of Mf^, provided we modify the estimator 77® accordingly. We denote by 
77'^ this latter estimator and prove that 77* can be used to build an adaptive nearly minimax 
optimal plug-in classifier, which can reach super-fast rates under a margin assumption. The 
above results essentially hinge on an exponential upper-bound on the probabihty of deviation 
of rj® from rj at a point, as detailed in Theorem 7.1. These results either improve on the current 
literature or are interesting in their own right for the following reasons. 

1) They show that it is possible to use MRAs to construct an adaptive nearly minimax optimal 
estimator rj® of 1] under the sole set of assumptions (CSl). More precisely, our results 
(i) hold in any dimension d; (ii) over the wide range of Lp(£, /i)-losses, p G [1, 00); (iii) and 
a large Lipschitz scale; (iv) and do not require any assumption on ^ beyond (Dl). It 
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is noteworthy that, in contrary to most alternative MRA-based estimation methods, no 
smoothness assumption on is needed. 

2) From a computational perspective, r]® outperforms other estimators of 1] under (Dl) since it 
takes full advantage of the lattice structure of the underlying MRA. In particular it requires 
at most as many regressions as there are data points to be computed everywhere on S, while 
alternative kernel estimators must be recomputed at each single point of S. We illustrate 
this latter feature through simulation. 

3) Furthermore, and in contrary to alternative MRA-based estimators, the local nature of rj® 
allows to relax the assumption that A is known. This latter configuration allows for /i to 
cancel on S as long as it remains bounded on its support A, which is particularly appropriate 
to the supervised binary classification problem under a margin assumption. 

4) In the regression on a random design setting, rj® bridges in fact the gap between usual linear 
wavelet estimators and alternative kernel estimators, such as the LPE. On the one hand, r]® 
inherits its computational efficiency from the lattice structure of the underlying MRA. On 
the other hand, it features similar theoretical performances as the LPE in the random design 
setting. In particular, it remains a (locally) linear estimator of the data (modulo a spectral 
thresholding of the local regression matrix), and cannot discriminate finer smoothness than 
the one described by (generalized) Lipschitz spaces. 

Here is the paper layout. We start by a literature review in Section 3. We give a hand- waving 
introduction to the main ideas that underpin the local multi-resolution estimation procedure in 
Section 4. We define notations that will be used throughout the paper and introduce MRAs in 
Section 5. Our actual estimation procedure is described in Section 6 and the results are detailed 
in Section 7. We show how these results can be fine-tuned under additional assumptions in 
Section 8. Assumption (SI) is relaxed and the properties of r]"^ are detailed in Section 9. We 
show how these latter results spread to the classification setting in Section 10. Results of a 
simulation study with t]® under (CSl) are given in Section 11. Proofs of the regression results 
can be found in Section 12. The proofs of the classification results are simple modifications of 
the proofs given in [4] and can be found in [39]. In addition, the Appendix contains a detailed 
review of generalized Lipschitz spaces and MRAs. 

3 Literature review 

Both the regression on a random design problem and the classification problem have a long- 
standing history in nonparametric statistics. We will therefore limit ourselves to a brief account 
of the corresponding literature that is relevant to the present paper. 

3.1 Classification with plug-in classifiers 

Let us start with a review of some of the classification literature dedicated to plug-in classifiers. 
The seminal work [37] showed that plug-in rules are asymptotically optimal. It has been 
subsequently pointed out in [36] that the classification problem is in fact only sensitive to the 
behavior of Px,r near the boundary line ^ := {t E £ : r]{T) = 1/2}. So that assumptions on 
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the behavior of Px,y away from this boundary are in fact unnecessary. Subsequent works such 
as [3] have shown that convex combinations of plug-in classifiers can reach fast rates (meaning 
faster than n~^/^, and thus faster than nonparametric estimation rates). More recently, it has 
been shown in [4] that plug- in classifiers can reach super fast rates (that is faster than n~^) under 
suitable conditions. All these results are derived under some sort of smoothness assumption on 
the regression function r] (see [50]) and a margin assumption (MA) (see Section 10 for details). 
This latter assumption clarifies the behavior of Px,y in a neighborhood of ^ and kicks in 
naturally through the computation 

^{hn) < 5P(0 < |2r/(X) - 1| < 5) +E|r/„(X) -r7(X)|l||,„(x)-,(x)|>5}, 

where S is chosen such that it balances the two terms on the rhs. Finally, [4] exhibited optimal 
convergence rates under smoothness and margin assumptions and showed that they are attained 
with plug-in classifiers. Let us now turn to the regression on a random design problem. 

3.2 Regression on a random design with wavelets 

First results on multi- resolution analysis (MRA) and wavelet bases (see [34, 38]) emerged in 
the nonparametric statistics literature in the early 1990's (see [27, 14, 13, 15, 16]). It has been 
proved that, under (CSl) and in the particular case where /i is the uniform distribution on S, 
thresholded wavelet estimators of rj are nearly minimax optimal over a wide Besov scale and 
range of Lp(£^, /i)-losses (see [10]). In order to leverage on the power of MRAs and associated 
wavelet bases, several authors attempted to transpose these latter results to more general design 
densities fi. This, however, led to a considerable amount of difficulties. 

The literature relative to the study of wavelet estimators on an unknown random design breaks 
down into two main streams, (i) The first one aims at constructing new wavelet bases adapted 
to the (empirical) measure of the design (see [29, 30, 9, 47]). (ii) The second one aims at coming 
up with new algorithms to estimate the coefficients of the expansion of r] on traditional wavelet 
bases (see [2, 23, 31, 41, 44]). The present paper belongs to this second line of research. 
As described in [23], the success of the LPE on a random design results from the fact that it is 
built as a "ratio", which cancels out most of the influence of the design. In a wavelet context, 
a first suggestion has therefore been to use the ratio estimator of rj (see [1, 42], for example), 
well known from the statistics literature on orthogonal series decomposition (see [20, 21] and 
[12, Chap. 17] and the references therein). Roughly speaking, the ratio estimator is the wavelet 
equivalent of the Nadaraya- Watson estimator (see [40, 49]). It is elaborated on the simple 
observation that ri{x) = ?7(a;)/i(x)//i(x) for all x E A, where both g{.) = ri{.)fi{.) and /i(.) are 
easily estimated via traditional wavelet methods. The ratio estimator relies thus unfortunately 
on the estimation of /i itself and must therefore assume as much smoothness on fi as on t]. 
To address that issue, an other approach has been introduced in [6, 28]. They work with d = 1 
and take S to be the unit interval [0,1]. Their approach relies on the wavelet estimation of 
rjoG^^, where G stands for the cumulative distribution of the design and for its generalized 
inverse. Results are therefore stated in term of regularity of foG^^. Unfortunately, this method 
does not readily generalize to the the multi- dimensional case, where G admits no inverse. 
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Finally, [5] obtains adaptive near-minimax optimal wavelet estimators over a wide Besov scale 
under (CSl) by means of model selection techniques. His results are hence valid for the 
h2{S, fi)-\oss only. 

Other relevant references that proceed with hybrid estimators (LPE and kernel estimator or 
LPE and wavelet estimator) are [18] and [51]. They both work under (CSl), with d = 1 and 
assume that is at least continuous. 

4 A primer on local multi-resolution estimation under 
(CSl) 




Figure 1: Description of the localization cells "H and their relations to the Suppy^j fc. 

In order to fix the ideas, let us now give a hand-waving introduction to the local multi-resolution 
estimation method. Throughout the paper, we will work with r-MRAs of L2(]R'^, A), for some 
r e N, consisting of nested approximation spaces Vj C Vj+i built upon compactly supported 
scaling functions (see Section 5.2 and Appendix). Under the assumption that 1] belongs to the 
generalized Lipschitz ball J^^{S,M) of radius M, the essential supremum of the remainder of 
the orthogonal projection J^^jT] of rj onto Vj decreases like 2"-^'* (see Appendix). The regression 
function t] can therefore be legitimately approximated by ^jt]. As an element of Vj, J^jt] may 
be written as an infinite linear combination of scaling functions at level j. In particular, there 
exists a partition J^j of £ into hypercubes of edge- length such that, for all "H G J^j and 
all X G "H, we can write ^jr]{x) = J^keSjin) ^j,kfj,k{x), where SjiTi) stands for a finite subset 
of Z'^ (see Figure 1). This leaves us in turn with the estimation of coefficients {aj^k)keSjiH) 
for all "H G J-'j, which is achieved by least-squares and provides us with the estimator r]® of 
f] on "H. It is noteworthy that the local estimator 77® of rj is exclusively built upon scaling 
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functions and does not require the estimation of wavelet coefficients. In particular, it does 
not involve any sort of wavelet coefficient thresholding. To the best of the author knowledge, 
this is the first time that this local estimation procedure is proposed and studied from both 
a theoretical and computational perspective. In addition, we show that Lepski's method (see 
[32], for example) can be used to adaptively choose the resolution level j. Notice that Lepski's 
method has already been used in a MRA setting in [43]. In what follows, we detail the local 
multi-resolution estimation method and establish the near minimax optimality of 77®. 

5 Notations 

5.1 Preliminary notations 

In the sequel, we will denote by Bp{z,p) the closed £p-ball of M"' of center z and radius p. 
More generally, we adopt the following notations: for any subset 5 of a topological space S, 
Closure(5) will stand for its closure and 5^ for its complement in S. For any subset S of 
W^, z G M.'^ and r G M.~^, we will write z + S and rS to mean the sets {z + u : u G S} and 
{tu : u ^ S}, respectively. Finally, given a set (of functions) 71, SpanT^ will denote the set of 
finite linear combinations of elements of TZ. 

For any p G N, vectors v of MP will be seen as elements of A^p,i, that is matrix with p rows 
and one column. For any two u,v & MP, {u,v) will denote their Euclidean scalar product. In 
addition, for any p, g G N and M G ■Mp,q, will stand for the transpose of M. For any two 
matrices M, P, M ■ P will denote their matrix product when it makes sense. [M]^^^ and [M]^^, 
will respectively stand for the element of M located at line k, column £ and the k*^ row of M. 
Finally, ||M||5 will denote the spectral norm of M (see [26, §5.6.6]). 

We denote by Yz\ the integer part of 2; G M defined as max{a G Z : a < z}. More generally, 
given z G M'^, we write \_z\ the integer part of z, meant in a coordinate-wise sense. In the same 
way, we denote by \z~\ the smallest integer greater than z (in a coordinate-wise sense). We write 
rhs (resp. Ihs) to mean right- (resp. left-) hand-side and sometimes write := to mean equal 
by dehnition. Throughout the paper, we will refer to constants independent of n as absolute 
constants and c, C will stand for absolute constants whose value may vary from line to line. 
For any two sequences a„, bn of n, we will write a„ < 6„ to mean a„ < C6„ for some absolute 
constant C and a„ ~ 6„ to mean that there exist two constants c, C independent of n such that 

Cbn <an< Cbn- 

5.2 The polynomial reproduction property 

In what follows, we will exclusively consider MRAs built upon Daubechies' scaling functions 
(fj^k (see Appendix and [8, 35, 7, 24]). Given a natural integer r, we will refer by r-MRA 
to a MRA whose nested approximation spaces Vj reproduce polynomials up to order r — 1. 
Daubechies' scaling functions (pj^k are appealing in the estimation framework since they are 
compactly supported and have minimal volume supports among scaling functions that give rise 
to r-MRAs. Recall finally that a r-MRA can explain Lipschitz smoothness s for any s G (0,r). 
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5.3 General notations 



Consider the Daubechies' r-MRA of L2(M'^, A) built upon Daubechies' scahng function yj, as 
described in the Appendix. We will denote by Snppipj^k = {r G M'^ : fj,k{T) > 0} the support 
of (fj^k- Recall that Supp(y9 = [— (r — l),r]'^. To alleviate notations, we will write ipk in place of 
(fo^k and ipj in place of ipj^o- Notice that Closure(Supp(y9j^fc) is in fact a closed hyper-cube of M*^ 
whose corners lie on the lattice 2~^Z'^. For any x G we write 



Furthermore, we write J^j := 2^^{{0, + nS. It defines a partition of S into 2^'^ hypercubes 
of edge length 2~\ modulo a A-nuU set. For the sake of concision, we write i? = 2r — 1 in the 
sequel. We have the following proposition, whose proof is straightforward and thus left to the 
reader. 

Proposition 5.1. Sj verifies the following properties, 

1. Sj is constant on each element T-i E J^j. We will denote by Sj{'H) its value on H. 

2. Moreover, for any two "Hi, € ^j, "Hi ^ 7^2, SjiTii) differs from Sj{'H2) by at least one 
element. 

3. Finally, for any n e Tj, i^SjiH) = 

It is a direct consequence of Proposition 5.1 that in the case where r = 1, we have jj^Sj{'H) = 1 
for all H E J^j. We denote its single element by z^('H). It is in fact easy to show that z^('H) = 
l2^x\ for any x eT-L. For any Ti E J^j, we write 

and denote by = (F4w(-^i))i<i<n- 

6 Construction of the local estimator 77® 

Assume we are under (CSl) and work with the Daubechies' r-MRA of L2(M'^, A). The esti- 
mation procedure is local, so that we start by selecting a point x E A. By construction, there 
exists "H G J^j such that x eH. We want to estimate rj at point x. As detailed in the Appendix, 
an estimator of rj can be reduced to an estimator of the orthogonal projection l^jTj of rj onto 
Vj, modulo an error !!%jri, such that \^jTi\ < M2~^'^ when 77 belongs to the generalized Lipschitz 
ball ^^{S, M) of radius M. Now, we can write 



This leaves us with exactly coefficients aj^^, v E SjlT-L) to estimate, which are valid for any 
X E Ti. We evaluate these coefficients by least-squares. Denote by By^ E M.n,R<^ the matrix 



Sj{x) = {vElf':xE Supp^Jj^j,}. 
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whose rows are the vectors ip-^^^XiY for 1 < z < n. Let us denote by ki, . . . , k^d the elements 
of Sj{l-L). Then we choose 

2 



n 



e arg min - X] ^tV^jMi^i) '^n{Xi 



"^eit- i=i \ t=i 



arg min \\Yn - ■ a|||m„), (4) 



where we set = if the arg min above contains more than one element. Let us write 
Qy^ = ■ By^/n G Aijidjid. As is well known, when Qy is invertible, the argmin on the rhs 
of eq. (4) admits one single element which writes as follows, 

= Qn ■ Ik ■ Yn. (5) 

Naturally, we will denote the corresponding estimator of J^jT] at point x by rjy{x) = (a^, ipy{x)). 
We now introduce a thresholded version of rjy based on the spectral thresholding of Qy. We 
denote by XmmiQn) the smallest eigenvalue of Qy in the case where r > 2, when Qy is actually 
a matrix, and Qy itself in the case where r = 1, when it is a real number. Furthermore, we 
define 

\Vn\^) otherwise 

where 7r„ is a tuning parameter. In practice, and unless otherwise stated, we choose 7r„ = log n. 
Moreover, we assume throughout the paper that n is large enough so that tt~^ < min(^^, 1), 
where, for reasons that will clarified later, we have denoted, 

and Cmin stands for the strictly positive constant defined in the proof of Proposition 12.4. 
Ultimately, the estimator rj® of J^^jrj is defined as, 

vfix)=J2vZix)lnix), xe£. (8) 

7 The results 

Let r be a natural integer, denote by V the set of all distributions on £^ x 3^ and write 

V{CS1, HI) ■.= {FeV : (CSl) and (H^) hold true}. (9) 
Furthermore, we define >, js, J and t{n) such that, 

2-^'^ = [nt{n)-^\, t{nf = /tTr^logn, 
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where k is a positive real number to be chosen later. In addition, we write Jn = {jr,jr + 
1, . . . , J — 1, J}. Notice that jg strikes the balance between bias and variance in the sense that, 
for logn > (2s + d) log 2 and s G (0, r), one has got 



2-jss < 2''+^2^'^in-i 



(10a) 
(10b) 
(10c) 



Throughout the sequel, we assume that n is large enough so that the latter inequalities hold 
true. Our first result gives an upper bound on the probability of deviation of r]f form 77 at a 
point X & A. 

Theorem 7.1. Fix r G N and assume we are under (CSl) and (H^). Recall that rj® is 
defined in eq. (8). Then, for all j G Jn, all 5 > 2M2~'''' max(l, 37r„i?'^/Xniax) and all x G A, we 
have got 



sup P®"(|r/(a;) -r7®(x)| > 5) 
PeP(csi,H^) 



7r„ 



where A is defined as follows, 

2 exp 



m 



l8K^I^^.. + AK2^iS 



1 



{5<M} 



52-^i 
2n„,R'' 



(11) 



under (Nil 



, , , 2a(/i^ax + 2^1(5)5 
1 A < , exp 



5V2 



Tin 



nS a 



2^-2 



Aimax + 2^ 2 5 



+2 exp 



n6^ 



2/i„.ax+|2^'^5^ 



under ('N21 



As a consequence of the above theorem, we can deduce the (near) minimax optimality of rj® 
over generalized Lipschitz balls. 

Corollary 7.1. Fix r G N and assume we are under (CSl) and (Hg). Then, for any 
p G [1, 00) and j G J7n, one has got 



sup 
PeP(csi,H^) 



j llLp(£:,M) 



2^f 

<C(p)<max(2-^^ — 



(12) 



where rj® and C{p) are defined in eq. (8) and Proposition 12.1 below, respectively. A fortiori, 
when s is known, we can choose j = js and apply eq. (10a) and eq. (10c) above to obtain 

sup E«"|h - vlWl^^s,,) < C{p)rV^n-^^. 
PeP(csi,Hr) 
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This, together with the lower-bound of Theorem 7.3, proves that rj® is (nearly) minimax optimal 
over the generalized Lipschitz ball ^^'"^{8, M) of radius M. 

The next Theorem shows that the approximation level j can be determined from the data 
so that we obtain adaptation over a wide generalized Lipschitz scale. 

Theorem 7.2. Fix r G N and assume we are under (CSl) and (Hg). We define 

2^1 2*=i 
gij,k) := ( —t{n) + —tin) 



fix) := inf{j e Jn : \vf{x) - r]f{x)\ < g{j, k),Wk eJn,k> j}, x e A, 

where rj® is defined in eq. (8) and inf = max(j7n) = J- If k is chosen large enough, meaning 
> 2^9 ^ ! where Cg is defined in Proposition 12.2, then we obtain 

sup < ^n^n{nfn-^^. 

PeP(csi,H^) ' 

So that rj®^^ ^(.) is a nearly minimax adaptive estimator off] over the generalized Lipschitz scale 
U -^^^.M). 

0<s<r 

Finally, we prove that r]® is indeed (nearly) minimax optimal by giving the corresponding 
lower-bound result. 

Theorem 7.3. Assume we are under (CSl) and (H^). We write infg^ the infinimum over 
all estimators On of rj, that is all measurable functions of the data Vn- Then, for d > 1, s > 0, 
we have, for all 1 < p < oo, 

inf sup E«"||^^„-r/||[ >n-^. 
On pe-p(csi,H^) 

The next section shows how these results can be improved in the case where we benefit from 
additional information on /z or 77. 



8 Refinement of the results 

As can be seen from Corollary 7.1 and Theorem 7.2 above, 7r„ appears as a multiplicative factor 
in the upper-bounds and thus deteriorates them by a multiplicative logn term. However, this 
needs not be the case, and under appropriate additional assumptions, 7r„ can be chosen to be 
a constant. Consider indeed the following two assumptions. 

(01) We know /x^j^ G M, such that < fi^^^ < /imin- 

(02) We know a finite positive real number M such that ||^7||loo(£',a) ^ M. 
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Under (Ol), we know a lower bound /i^j^ of yUmin, and therefore a lower bound g^^^ of 
(see eq. (7)). Under (Ol), we will thus choose vr"^ = min(^™,l). It is straightforward to 
show that Theorem 7.1 is still valid with this new value of vr^ (see Remark 12.1 in the proof 
of Theorem 7.1), and thus all the subsequent results follow as well. Under (02), we know an 
upper bound M of the essential supremum of r] on £. In that case, we redefine 

Vni^) = TMiVHix))'^{K.uAQn)>o}, (13) 

where, for any z G M, we have written Tm{z) = z1^\z\<m} + Msign{z)l^\;^\^M}- Once again, it 
is straightforward to show that Theorem 7.1 is now valid with vr~^ = min(^'|^, 1) and 2M in 
place of M in the indicator function on the rhs of eq. (11) (see Remark 12.1 in the proof of 
Theorem 7.1), and thus all the subsequent results follow as well. 

Notice that 7r„ is an absolute constant under (Ol) and (02), while it is an increasing sequence 
of n to be fine-tuned by the statistician otherwise. Hence 7r„ appears to be the price to pay for 
not knowing a lower bound of /imin or an upper bound of the essential supremum of r] on £. 

9 Relaxation of assumption (SI) 

9.1 The problem 

Now, we would like to relax assumption (SI) and allow for A to be an unknown subset of 
S, eventually disconnected. Under (CSl), the success of r^® stems from the fact that it is 
constructed upon an approximation grid of the form 2~-'Z'^ fl [0, 1]*^, whose edges coincide 
exactly with the boundary of A. In the case where A is unknown, some cells of the lattice 
might straddle the boundary of A and thus require a new treatment. 

In order to handle this new configuration, we will need to make a smoothness assumption on 
the boundary of A and allow for the estimation cells to move with the point at which we want 
to estimate r]. Ultimately, we devise a new estimator r]*^ of r] which is built upon a moving 
approximation grid. In fact, this new estimation method ensures that the point x at which we 
want to estimate rj always belongs to a cell H of J^j at resolution level j, whose center belongs 
to A. This will ensure that local regressions performed on cells that straddle the boundary of 
A are still meaningful. 

The smoothness assumption we will make on A might be compared to the support assumption 
made in [4, eq. (2.1)] in the classification context. In substance, it is assumed in [4] that A 
is locally ball-shaped to be compatible with the ball-shaped support of the LPE kernel, which 
they use to estimate rj. In our case, we perform estimation with multi-dimensional scaling 
functions whose supports are cube-shaped and will thus assume that A is locally cube-shaped. 

9.2 Smoothness assumption on A 

Let us now make these informal arguments more precise. To that end we introduce assumption 
(S2) as an alternative to (SI) above. Fix an absolute constant nxo G (0, 1) and recall that 
2-'= = [n2^J. With these notations, (S2) goes as follows. 
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Figure 2: (S2) allows for A to be non-convex and eventually disconnected. 

(S2) £ = W'- and A belongs to £^j^, where 

■= {A C R'^ :3m > mo,Va; G A, 

G G i3oo(^.,m) C 2^^ {A - x)}, 

In words, (S2) means that if we zoom close enough to any x G we can find a hypercube 
Boo{zx, m) that contains x and is a subset of A. Notice readily that for all ji > j2, the component 
of 2^'^ {A — x) that contains is a subset of the component of 2^'^ {A — x) that contains 0, so 
that C Therefore ^/j^ grows with n and shrinks with s. Of course, (SI) is a particular 
case of (S2). Setting (S2) allows A to be unknown and belong to a wide class of subsets of R'^, 
eventually disconnected (see Figure 2). 

In the sequel, we will conveniently refer by (CS2) to the set of assumptions (Dl), (S2), (Nl) 
or (N2). 

9.3 Moving local estimation under (CS2) 

As detailed above, i]*^ is obtained by local regression on a moving approximation grid. Let us 
describe the construction of 1]^ more precisely. 

First of all, we split the sample into two pieces. For simplicity, let us assume that we dispose 
of 2n data points. The first half of the sample points, which we denote by = {{X'^, Yl),i = 
1, . . . ,n}, will be used to identify the support A of fi, while the second half, which we denote 
by Vn = {{Xi,Yi),i = 1, . . . ,n}, will be used to estimate the scaling functions coefficients by 
local regressions. 

Let us denote by "Ho the cell 2~^[0, 1^ of the lattice 2~^TL^ at resolution j. And denote by Hq^x) 
the same cell centered in x, that is T-Loix) = x — 2~^~^ + 2~^[0, l^. Then, the construction of 
rif{x) at a point x G M"^ goes as follows, (i) If none of the design points (X^) of the sample 
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lie in T-Lq^x), then take rf^{x) = 0. (ii) If one or more design points of the sample lie in 
T-Lo{x), we select one of them and denote it by X'^^ (the selection procedure is of no importance 
beyond computational considerations). By construction, x belongs to the cell T-Lq^XIJ centered 
in Xj'^ G A. Since X^^ belongs to A, it makes sense to perform a local regression on Tio^XlJ 
with the sample points P„, which gives rise to an estimator 77* of t] valid at any point of 
Ho^XlJ n A. It is noteworthy that this procedure uses the sample V'^ to identify the support 
A of /X. 

Interestingly, the above estimation procedure requires at most as many regressions as there 
are data points in to return an estimator 77* of 1] at every single point x E A. It is 
therefore computationally more efficient than any other kernel estimator, such as the LPE. The 
computational performance of 77* can in fact be further improved in the sense that the local 
regression on the cell HqIXI) can be omitted if the cell T-Lq^XI) is itself included in the union of 
cells centered at other design points of V'^. In particular, we can choose X'^^ to be a design point 
X- of that belongs to T-Lo{x) and for which a local regression has already been performed, if 
it exists, or any one of the X- that belong to Tioix) otherwise. 

Intuitively, the computational efficiency of 77* stems from the fact that the design points (X-) 
provide some valuable information on the unknown support A of /i, which can be exploited 
under (CS2). In particular, and as we will see below, (Dl) guarantees that the design points 
of V'^ populate A densely enough so that, as long as j < J, the cells 'Ho{Xl), 1 < i < n, form 
a cover of A, modulo a set whose /x-measure decreases almost exponentially fast toward zero 
with n. 



9.4 Construction of the local estimator rj^ 

Assume we are under (S2) and work with the Daubechies' r-MRA of L2(]R'^, A). Obviously, 
shifting the approximation grid is equivalent to shifting the data points (Xj) of Vn and keeping 
the lattice fixed. For ease of notations and clarity, we adopt this second point of view. In 
order to compute 7/* at a point x G 'Ho(X-^) fl A, we want to shift the design points in such 
a way that X-^ falls right in the middle of T-Lq. In other words, we want X-^ to be shifted at 
point 2~^~^ G M'^ (whose coordinates are worth 2~^~^ G M). This corresponds to the change of 
variable Xj = Xj — (X-^ — 2^-'^^), where we have denoted by Xj and Xj the representations of 
a same data point in the canonical and shifted coordinate systems of W^, respectively. In order 
to compute r/* at point x G 'Ho(X-^) fl A, it is therefore enough to perform a local regression 
on Tio against the shifted data points, 

V^ = {{Xi,Yi},t = l,...,n}. 

For the sake of concision, we will denote hj u = u — (X-^ — 2^^^^) the coordinate representation 
of a point u in the shifted coordinate system of R"'. Let us denote by ki, . . . , k^d the elements 
of Sj{l-Lo). With these notations, eq. (4) must be corrected and written as 

G arg min pi - at^jMi^i) tn{Xi), (14) 



"sif--- i=i \ t=i 
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where we set = if the argmin above contains more than one element. The notations 
introduced in Section 5.3 can be updated to this new setting as follows. stands now for 
the random matrix of A^„,ijd whose rows are the ip-^g^XiY, i = 1, . . . ,n. In addition, we recall 
that we have defined Qy^^ = B\^^ ■ B-^^^/n G Aij^d j^d. Its coefficients write thus as 

1 " 
^ i=i 

Notice here that SjiTio) = {u E Z'^ : 2^^ E Suppip„}, which neither depends on j nor x. 
Therefore, and for later reference, we denote 

&:={uEZ'^: 2-^ G Supp(^4, (15) 

In addition, if we write = (Fjl^(,(Xj))i<j<„, then eq. (5) still holds true when the solution 
to eq. (14) is unique. So that, for all x G 'Ho{Xi^) fl A, we can write ^7|^q(x) = (a^^, V5^,)(x)). 
Finally eq. (6) remains valid with Xj replaced by Xj and Ti hj T-Lq, t]^^ redefined as r^^^ and 
(7min redefined as 

where Cmin is the strictly positive constant defined in Lemma 12.1 below. So that ultimately, 
the estimator rjj' of /^^^jT] at a point x G M'^ writes as 

vfix)=r]^^{x), xeS. (17) 

Notice that by contrast with eq. (8) above, the sum over the hypercubes of J^j has disappeared. 
This is due to the fact that the approximation grid moves with x so that we end up virtually 
always performing estimation on the same hypercube "Ho- 



9.5 The results 

Interestingly, r]'^ still verifies similar results as the ones described in Section 7. To be more 
precise, recall that we work with a sample of size 2n broken up into two pieces T>n and of 
size n. Let us redefine J'n so that J'n = {js, js + 1, . . . , J — 1, J} where 2^" = [n^^ J . Then, we 
obtain the following result in place of Theorem 7.1. 

Theorem 9.1. Fix r G N and assume we are under (CS2) and (H^). Recall that rj^ is 
defined in eq. (17). Then, for all j G J'n, all 6 > 2M2~'^'* max(l, 37r„i?Vmax) o,nd all x E A, we 
have got 

sup F^''{\r]{x) -7]f{x)\>6) 

PeP(CS2,H|) 

< 3i?2^exp (-n2-^'^ — -] tss<M} 



+ R''A 



52-^f 



27r„i?'^ 

where A has been defined in Theorem 7.1. 
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Left aside the fact that rf^ is constructed upon a sample of size 2n, the sole difference with the 
result of Theorem 7.1 is that the leading constant in front of the exponential on the second line 
has changed from 2R'^ to Si?'^. Furthermore, it is straightforward to deduce from Theorem 9.1 
results similar to Corollary 7.1, Theorem 7.2 and Theorem 7.3, and a fortiori the refined results 
obtained in Section 8, for 77* under (CS2). The proofs of these results for rf^ under the set of 
assumptions (CS2) follow, for the most part, exactly the same lines as the proofs given for i]® 
under (CSl). Details can be found in Section 12.2. 

10 Classification via local multi-resolution projections 

Recall from [4] that the margin assumption can be written as, 
(MA) There exist constants > and d >Q such that 

<\27]{X)-l\<t) <C,f, Vt>0. 

The binary classification setting corresponds to (CS2), under assumptions (Nl) and (02). 
Notice besides that we have = 1 in (NX) and M = 1 in (Hg). Since we are under (02), 
it follows from Section 8 that 7r„ = ttq = min(l, is independent of n and rf^ is capped at 
M = 1 as in eq. (13). For the sake of coherence, we denote by the adaptive resolution level 
built upon ?7*, as described in Theorem 7.2, and define V{CS2, H^) by analogy with eq. (9) 
above. Finally, we recall that 77* is built upon a sample of size 2n split into two sub-samples 
Vn and V'^ of size n. 

As a consequence of Theorem 9.1, we can use the plug- in classifier built upon 77* to obtain 
similar results as the ones given in [4, Lemma 3.1] for LPE based plug- in classifiers. 

Corollary 10.1. Fix r G N and assume we are in the binary classification setting. Assume 
moreover that (Hg) and (MA) hold true. Consider the plug-in classifiers hj^{.) = 

and hj^(.) = {■)>l} ■ Then, as soon as k > Co{l + 'd) , we have 

sup ^{hfj < Cin-^^^'+^\ (18) 

Pe'P(CS2,H^,MA) 

sup ^{h%) < C2(logn)^n-5^(^+''\ (19) 

PeP(CS2,H|,MA) 

where the classification risk <^(.) has been defined in Section 1 and the constants Co, Ci, C2 are 
made explicit in [39] and only depend on ^max, fJ^min,r,d and ^. 

In fact, it can be shown that the classifiers /i* defined in Corollary 10.1 are (nearly) minimax 
optimal. Proofs of Corollary 10.1 and the associated lower-bound can be found in [39]. 

11 Simulation study 

In order to illustrate the performance of 77^, we have carried out a simulation study in the 
regression setting in the one- dimensional case, that is with d = 1. As detailed earlier, the sole 



17 



purpose of this simulation is to show that (1) rj® can be easily implemented and is compu- 
tationally efficient, (2) rj® works well in practice in the case where the density of the design 
H is discontinuous, (3) and to give an intuitive visual feel for rj®, which is built upon the 
juxtaposition of local regressions against a set of scaling functions. In particular, we run our 
simulation against benchmark signals, which allows to compare them with the ones detailed 
in the literature for alternative kernel estimators (see simulation study in [32], for example). 
We have run them under (CSl), which corresponds to the case where t]® can be completely 
computed with exactly 2^ regressions. We have in particular S = [0, 1] = A. We focus on the 
functions f] introduced in [14] and used as a benchmark in numerous subsequent simulation 
studies. They are made available through the Wavelab850 library freely available at http : // 
www-stat.stanford.edu/~wavelab/. In addition we have chosen the noise ^ to be standard 
normal, that is we are working under (N2) with a = 1. In all cases, we have chosen the 
signal-to-noise ratio (SNR) to be equal to 7. To be more specific, we are working on a dyadic 
grid G of [0, 1] of resolution 2~^^. We compute the root-mean-squared-error (RMSE) of both 
the signal and the noise on that grid and rescale the signal so that its RMSE be seven times 
bigger than the one of the noise. 

Let us now give details about the simulation of the sample points and the computation of 
the estimator. We divide the unit-interval into ten sub-segments Aj. := 10^^ [A;, A; + 1] for 
k = 0, . . . ,9. We define the density of X as follows. 

9 

k=0 

We choose the p^s at random. To that end, we denote by {uk)o<k<9 ten realizations of the 
uniform random variable on [.25, 1], write v = uq + . . . + ug and set pk = UkV~^. Notice that 
this guarantees that yU > mino<fc<9 lOpfe > yUmin = 0.25 on [0, 1]. We then simulate 3000 sample 
points Xi according to /i. Finally, we bring the points back on the grid G by assimilating 
them to their nearest grid node. Since the Xj's are supposed to be drawn from a law that is 
absolutely continuous with respect to the Lebesgue measure on [0, 1], we must keep only one 
data point per grid node. This reduces the number of data points from 3000 to the number 
that is reported on top of each of the histograms. 

In order to compute the adaptive estimator at sample points Xj, we use the boundary- corrected 
scaling functions coded into Wavelab850 for r = 3 and for which we must have j > 3. We set 
J = [log (n/ log n)/ log 2] . The elimination of redundant sample points on the grid removes 
on average 150 points so that we obtain J = 10. We therefore have J7n = {3, 4, . . . , 10}. 
Notice interestingly that the computation of rjf requires only 8 regressions and rjf^ requires 
1, 024 of them. This is much smaller than for the LPE whose computation necessitates as 
many regressions as there are sample points at each resolution level. In practice, we compute 
the minimum eigenvalues of all regression matrices across partitions and resolution levels and 
choose to be the ffist decile of this set of values. When proving theoretical results, we 
have chosen r]® to be zero on the small probability event where the minimum eigenvalue of the 
regression matrix is smaller than vr"^. In practice we can choose it to be an average value of 
the nearby cells in order to get an estimator that is overall more appealing to the eye. In our 
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simulation, we in fact do not use that modification. Instead, we modify j to be the highest 
j G {3, . . . , j®} such that has been computed from a vahd regression matrix, meaning a 
regression matrix whose smallest eigenvalue is greater than the threshold vr"^. 
In practice, for a given signal, we generate /i at random and compute 77^ for 100 samples drawn 
from fi. We quantify the performance rj^e, by its relative RMSE, meaning its RMSE computed 
at sample points Xi divided by the amplitude of the true signal, that is its maximal absolute 
value on the underlying dyadic grid. We display results for "Doppler" , "HeaviSine" , "Bumps" 
and "Blocks" corresponding to the median performance among the 100 trials. Each figure 
displays four graphs. Clockwise from the top left corner, they display in turn, an histogram of 
sample points Xf, the adaptive level j® at sample points Xf, the true signal (black dots) and 
the estimator r]®^ at sample points Xi (solid blue line) and its corresponding relative RMSE in 
the title; and finally the original signal (solid blue line) with its noisy version at sample points 
Xi (red dots). 

12 Proofs 

12.1 Proof of the upper-bound results under (CSl) 
12.1.1 Proof of Corollary 7.1 

Consider the term 

1= [ E[\r]{x) - rj®{x)\P]fi{x)dx. 

Now, apply Proposition 12.1 and notice that J^fi{x)dx = 1 to show that / is upper-bounded 
by the term that appears on the rhs of eq. (12) stated in Corollary 7.1. In particular, for all 
1 < p < cxD, we obtain / < C {p)7i^t{n)~P < C{p) < 00. This in turn proves that we can apply 
the Fubini-Tonelli theorem to get 

and concludes the proof. □ 
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12.1.2 Proof of Theorem 7.1 

Let X E A and j G J'n- There exists Ti G J^j such that x E Ti. Let us work on the set 
{^miniQn) ^ ^n^} OB. which Q-}{ is invertible. On that set, we can write 

= \(Qn-[^-iBn-an-Yn)yVn{x))\ 



n 



\-i\\Bh 



Now, notice that for all Xj G "H, we have Fj = (a^,y9^(Xj)) + ^jTj^Xi) + ^j. Write ^n 
{Mj7]{Xi)l'u{Xi))i<i<n and = (^il^(Xi))i<i<„. Then, we have. 



Wn = \^- (Bn ■ an - Yn)\ = |^ ■ {^n + ^n)\ G K^"' 
n n 



Thus, a direct application of Proposition 12.5 allows to write, for 6 > 2M2 ^'^ max(l, SunR'^fJ'ma.x), 



62-^1 

<n\\Wnl,(^^R^)> — 



< R" sup P [Wn]k > 



27r„i?2 
52-4 



62~^ 



2'KnR'^ 

By definition, we have fjfix) = r]y^{x), so that we have 

P(|r/(x) - vf{x)\ >6)= FMx) - r^®(x)| > 6, KUQu) > vr"^) 



(20) 



By construction, //^(x) = r(^{x) on the event {Amm(<5w) > ^n^} ^'^'^ ^S(^) = on its 
complement. So that we obtain \rj{x) — iin{x)\ = \ri{x)\ < M on the rhs of eq. (20). No- 
tice in addition that M2~^'^ > \Mjri{x)\ under (Hg) (see Appendix). Finally, we obtain, for 
I > M2-^' > \Mj7]{x)l 

P(|r/(x) -r/f(x)| > 5) 

< P(|^,r7(x) -r/|^(x)| > -,A^in(gw) > tt-^) + P(A,,i„(g^) < 7r;^)l|M>5}, 

where we have written M = M. The term on the Ihs has been dealt with above. The term on 
the rhs is tackled using Proposition 12.3. This concludes the proof. 
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Remark 12.1. Under (02), we have |r/^(a;)| < M, and since rj G Jff^{S,M), we obtain 
\ri{x) — ri'^{x)\ < 2M on the rhs of eq. (20). While on the Ihs, it is straightforward that (see 
, Chap. 10]) 



- r]f^{x)\ = \r]{x) -TM{Vni^))\ < Ivi^) - r]ni^)\- 
Under (Ol), the proof remains unchanged. So that the proof still holds with 



M 



2M, under (02) 
M, otherwise. 



□ 



12.1.3 Proof of Theorem 7.2 



This result is obtained after a shght modification of [32, Proposition 3.4]. In the same way 
as in the proof of Theorem 7.1, we are brought back to controlhng Elr^'^, .(x) — ri(x)\^ for all 
X G A. To that end, we split this term as follows 

= / + //. 
Let us first deal with /. Notice that 

The last term is of the good order since 

/ 2^4^'' 
E\rif(x) - ri(xW < C(p)<max 2-^^', 

\ \/n 

[Klogn)2 \ / 

according to Proposition 12.1, eq. (10a) and eq. (10c). Regarding the first term, notice that on 
the event {j®{x) < js}, one has got 

\v%ix)i^) - < 9{j®ix),js) < sup g{k,js) 

2^4 

< g{js,Js) = 'it{n)^ < 2t{n)2'n-^^, 
'n 



where we have used eq. (10a) and eq. (10c) and which is of the good order too. Let us now 
turn to II. For any two j < k, we write 

g{x,j, k) = {\r]f{x) - Vki^)\ > aU, k)}. 
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Write J^n{j) = {k G J'n '■ k > j}. Notice first that we have the following inclusions 

{fix)=j}c U g{x,j-i,k), 

keJn{j-i) 

{f{x)>Js}= U {f{x)=j}C U U g{x,j-l,k). 
Therefore, we can write 

< E E E|r^f(x)-r^(x)r%.,,_i,,). 

Now, we notice that 

\vfix)-vfix)\ < \rif{x)-ri{x)\ + \ri{x) - rif{x)\. 

So that 

Qix,j,k) = {\vfix)-rif{x)\ > gij,k)} 

C i^lvfix) - v{x)\ > \jl^\vf{x) - v{x)\ > 

/ 24 \ / 2^i 

¥{g{x,j, A;)) < P I \r]f{x) - r]{x)\ > —t{n) | + P ( \r]f{x) - 7]{x)\ > —t{n) 



So that a direct application of the Cauchy-Schwarz inequality leads to 

E\r)f{x) - < (E|f (x) - r/(x)p^)ip(^(x, j - 1, 

Now, a direct application of Proposition 12.1 for jg < j < J gets us 



(E\T]fix)-r]{x)\^n'' < VcWjnPmax { 2-^^ < y^C{2pj{K\og 



n) 2 . 



n 



Besides, notice that for js ^ j < k < J, we can apply Proposition 12.2 with n > ^Cg ^ to 
obtain 

P (^vf{x) - Vix)\ > j V P (^vf{x) - r/(x)| > j < 5R^'^n-^2. 

To conclude the proof, it remains to notice that # J7n < log n and remark that the multiplicative 
constant in the upper-bound of Theorem 7.2 is indeed smaller than, say, 5 for n large enough. □ 



24 



12.1.4 A few useful Propositions and Lemmas 

Proposition 12.1. Fix r G N and assume we are under (CSl) and (Hg). Then, For any 
X & A and i & Jn, one has got 

( Ti^" 
E[|r7(x) - r/f(x)|^'] < C(p)<max 2-^^ 



where 

C{p) = max(l, i^^V^ax)^ + C,{r, d, p, /i^^x; K, a) + 2Mm^\ 
and C5 is made explicit in the proof at eq. (21). 

Proof. For any x & A, take 5 = 3M2~^^ max(l, 37r„i?'^;Umax)- Notice first tliat max(l, 37r„_R'^/imax) < 
7r„ max(l, 3-R'^/imax) since, by construction, 7t~^ < 1 in any case. Now, write 

E[\r]{x) - vf{x)\P] = [ ptP-^F{\r]{x) - r]f{x)\ > t)dt 
Jr+ 

00 

— r. . 



< F + / ptP-^F{\r]{x) - r]f{x)\ > t)dt. 

5 



As 5 has been fixed, we only need to tackle the rhs above, which we will denote by //. Using 
Theorem 7.1, we can write 

( -2 \ rM 

-n2-^'^ 7^^-^ — / pt^'-^dt 

Denote by Hi and II2 the Ihs and rhs terms above, respectively. Now, recall that j < J, where 
2Jd ^ nt{n)~'^ and t(n)^ = Kn^logn. Therefore, as soon as 



« > - I 2/imaxit +-R vr„ , , 



we have J/i < 2MPR^'^n-l. Let us now turn to 1/2. Assume first that we are working under 
the bounded noise assumption, (Nl). In that case, we have 

, 2^1^' 
K) n„ 



n 
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where the last inequahty results from the change of variable u — \fn2 ^^ir^^t together with 
the fact that 2^''' < n and we have written 

-.^ 2fl^f pt-' exp (- e,^.^..^l^s/fM ) 

Assume now that we are working under the Gaussian noise assumption (N2). In that case, we 
have 



tn-^2-^^V27m 



+ 2R'' / pt^-' exp ( W ^ ] dt. 

Denote by 11^ and 11^ the first and second term, respectively. They can both be handled in 
the exact same way as 7/2, which leads to 



/n 



where we have written 

poo ^ f2 \ 

dt, 



SR'^^liu.s^ + ^RH 



and 



, 24 V 

Ih < Cslr, d,p, 11^^, a)\7in—;=\ , 



n 



where we have written 

2(7i?i(4i?Vma. + 2t)^ 



C-i-.^R'^J^ ptP-^llA 



tV2TT 



To conclude, let us write 



I C2{r, d,p, fij^i^y^, K) under (Nl) 

63(r, d,p,Aimax, cr) + 64(r, d,p,Aimax) Under (N2) 
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Therefore, we ultimately obtain 

E[\r]{x) - < (3Wmax(l, 3i?Vmax)^ + C5 + IM^^") 



Til max I 2-^^ 

In 



which concludes the proof. □ 



Proposition 12.2. Fixr in N and assume we are under (CSl) and (Hg). This means in 
particular that s G (0,r). Let j be such that jg < j < J. Let t{n)'^ = uTr^logn, and define 



Cei^r, d, yUmax, K, 7r„), under (Nl) 
Cfi{r, d, /imax, 71"^), undcr (N2) 



where Cq is defined in eq. (22) below. Then we have, forn large enough, 

P I \rjf{x) -r]{x)\ > ^^(^) 1 < 5i?2'^n-'^^^ 



'max I 



Proof. The proof relies on a direct application of Theorem 7.1. Write Cq = 2M max(l, 3iTnR jJ' 
and notice indeed that the theorem applies since for j > jg, we get 2^2n~2 > 2^*^''+2)2~-^''* (see 
eq. (10b)) and, as soon as n is large enough, we have t{n) > T^^^Cq. This leads us to 



tin) 



27r„i?'^v^; ■ 



Let us denote the first term by / and the second one by //. / is easily tackled noticing that 
for i < J, n2~^'^ > n2~^'^ > t{nY = Kvr^logn. So that, we obtain / < 2E?'^n~'^'^^ , where we 
have written 

„ , , \ min(l,JC~^) , , 

C6(r, d, /i„,ax, 7r„) := — D2d , spd -1 • (22) 

Let us now turn to //. Assume first we work under (Nl). Then we can write 

t(n)27r-2 



II < 2^" exp 
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Notice first that 2^2t(n) < ^Jn. Therefore, we obtain // < IR'^n ^'"'^ . Assume now that we 
work under (N2). In that case, we obtain 



^ t(n)7r-iv^ 



exp 



4i?2V„.ax + 2/?%-l^ 



2^-2 



+ 2i?°'exp ' ^ ^ " 



We proceed exactly as under (Nl). So that we obtain // < C-jn~'^^^, where 

minfl, cr^^) 



C8(r, d, 



r-l' 

nax I " 



t(n)7r-V27r 

So that C7 < Si?*^ for n large enough. Notice finally that Cs{r, d, /imax? ^? 7r„) > C6(r, d, /imax, ^5 7r„). 
This concludes the proof. □ 

Proposition 12.3. Fix an integer r > 1 and assume we are under (CSl). Let x E A 
and j G J'n- By construction, there exists Ti G J^j such that x E Ti. Recall besides that 
i^Sjin) = R'^, where R = 2r-1 is obviously independent of both X and j . Write \\.\\ = ||.|L 
and assume there exists a strictly positive constant (7min independent of x and j such that 

(23) 

ueR^ ■.\\u\\=i 



Then, for any real number t such that < t < we have 
nXmUQu) <t)< 2i?2'^exp (-n2-^'' 



Proof. Under the assumption described in eq. (23), we get 

Amin((5w) > min {u,EQnu) + min {u, {Qn - 'EQn)u) 

ueR^"^ :\\u\\=l ueR'^'' :\\u\\=l 



>2t- Yl \lQnly - l^Qnlyl 
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Write T, = y^j^,{Xi)y^jy{Xi)ln{X,) - E(^,-,(X)<^,,,,(X)1^(X), so that ET, = 0, VarT, < 
/^max2-'°' and |Tj| < 2^'^^^. A direct application of Bernstein inequality for any 6 > leads to 

n]uy - [^Qn]iyy\ > ^) 
1 " 



n . 

1=1 



< 2 exp 
To conclude, we write 



□ 



Proposition 12.4. Fzx an integer r > 1 anc? assume we are under (CSl). For any x E A 

and j G J'n, we denote by Ti the unique hypercube of Fj such that x E Ti. Then, there exists 
a strictly positive absolute constant g^ij^ independent of both x and j such that, for all j G J7n 
and all x E A, we have Xmini^Qn) > fl'min > 0. 

Proof. For any u G M.^'' such that ||'u||^_^^jgjjd-| = 1, we can write 

\ ^ 



/in 



{ Yl ''■^'PjA^) dw, (24) 
/ Yu^(f^{w) dw, (25) 



where (3 has been defined in eq. (15) and the last equality results from the fact that the value 
of the integral on the rhs of eq. (24) is invariant with "H. Let us denote by ~^ the unit-sphere 
of . As detailed in [39], the map 



u G §^'-1 ^ 



/ V'Mi,v5i.(w) dw, 



29 



is absolutely continuous with respect to u on the compact subset S'^'*^^ of M^''. It therefore 
reaches its minimum at some point u* E . It is a direct consequence of the local linear 
independence property of the scaling functions {ipk) (see Proposition 12.7) that 



dw = Cmin > 0, 



where Cmm is a constant that is both independent from x and j. This concludes the proof with 

9m'm A'rninCmin- I— I 

Proposition 12.5. Let „ and ('Ci)j=i,...,n be sequences of independent random vari- 

ables such that E(^|X) = 0. Take any j > jV- Moreover, assume we are given a function 
such that \\^j{-)\\hac{£,X) ^ M2~^''^ , a subset "H of S and a scaling function ^j,k- Write 



1 " 

n . 

1=1 



and define 



2 exp 



n6^ 



under (Nl) 



j 2(T(/imax + 2^25)2 

I — JV^ — "''P 



n5 a 



2^-2 



/imax + 2-' 2 5 



+2 exp 



under (N2) 



2/imax + 

Then, for all 5 > 3n^s,xM2~^^'^~^^\ we have 

n\WjA >S)< A{6). 

Proof. Notice indeed that 

1 " 

W,,k<\-J2^JAX^)^^MX^)\ 
^ i=l 

1 

+ |-5^<^,,fc(X,)^,(X,)l^(X,) -E<^,-fc(X)^,(X)l^(X)| 



i=l 



+ \Eip,4X)^j{X)ln{X)\ 
= 1 + 11 + III. 
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So that we can write 

mWj,k\ >S)< P(/ > 6/3) + P(// > 5/3) + P(/// > 6/3). 
Now it is enough to notice that 

III < j \ipj^k{w)Mj{w)\l-u{w)^i{w)dw 

max 

/ \Lpj^k{w)Mj{w)\dw 

< /^max||V'j,fc||Li(£-,A)||=^j||Loo(f,A) 

So that > 6/3) = as soon as 5 > 3/imaxM2-J(^+l). 

Now, turn to // and write // = l^^Ti/^l with Tj = ipj^k{Xi)Mj{Xi)lfi{Xi) 
- Eipj^k{X)Mj{X)l'H{X). Obviously ETi = 0, VarT^ < E(</?j>(X)^j(X)l^(X))2 
< /imaxM22-2j'^ and \Ti\ < M2-^'2^2+\ So that we can apply Bernstein inequality to get 



P(// > 6/3) < 2 exp 



n 



And finally, turn to ///. Assume first that the noise ^ is bounded by K. We have obviously 
Eipj4X,)^,lniX,) = 0, Var(v;,-fc(X,)e.lw(^i)) < i^Vmax and \ip,4Xi)^,tniX,)\ < K2^l+\ so 
that 



P(/ > 6/3) < 2 exp 



n6^ 



18K^fi^,, + 4K2^l6 



Now, it is enough to notice that for all s > and j such that j > 7 log2 ^ (which becomes a 
constraint for < M only). 



n2'^^'6^ n6^ 
> 



18/x„axM2 + 4M2J 22i^5 ISK^fi^^^ + AKT26 

which concludes the proof under (Nl). When j > Mog2 3M, the conclusion under (N2) is a 
direct consequence of Proposition 12.6. □ 

Proposition 12.6. Let ipj^^ be a scaling function andT-L a subset of £ . Define 

1 " 

I = - Y.^UX^)^^t'H{X,). 



n 
1=1 



Assume now that the noise ^ is conditionally Gaussian, that is we are under (N2). Then, we no- 
tice that, conditionally on Xi, . . . , X„, I ~ $(0, crpj ^/ y/n), where p'^j^ = n^^ J2^=i fj,k{Xi)'^^H{^i 
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Then, for all 5 > 0, one can write 



,■ d „x 1 



> 5) < 1 A <^ — — exp ' 



+ 2 exp 

Proof. For any 5 > 0, we write 



6V27m V /irnax + 2-'2(5 

n6^ 



Notice first that 



So that 



The first term is handled thanks to a regular Gaussian tail inequality. Notice indeed that 

1 2pj I n6^ \ 1 

• d ^, 1 



< 1 A < — — — exp ' 



5V2^ V /imax + 2^2(5 

In addition, notice that E(fj^k{X)Hn{X) < /imax2^'^ and ly^^- fc(X)2l^(Xi) -E(/?j-fe(X)2l^(X)| < 
2jd+i^ so that a direct application of Bernstein inequality leads to 

P(q,(2^S5)'=) < 2 exp I -— = 2exp ' 



2^-^(2/i^ax + |2^'^5)y V 2/i„,,. + |2^25^ 

which concludes the proof. □ 

Proposition 12.7. Let m he a constant such that m > and fix z E W'- such that z G 
Boo{2~^ ,m) . Write & := {k E Z'^ : 2^^ G Suppipk], the set of indexes corresponding to the 
scaling functions whose support Suppipk contains the point 2^^ G W^. The scaling functions 
{(pk) verify the local linear independence property in the sense that Ylkee ^kVk = on 
the domain Boo{z, m) if and only if = for all k E &. 

Proof. This result is derived from [33] and its proof can be found in [39]. □ 
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12.2 Proof of the upper-bound results under (CS2) 

Recall that under (CS2), we work with a sample of size 2n split into two sub-samples denoted 
by Vn and P^. As detailed previously, similar results as the ones described in Section 7, 
Section 8 and Section 12.1.4 are still valid with rf^ under (CS2). They in fact all stem from 
Theorem 9.1. The proofs remain for the most part unchanged, with Jn redefined as Jn = 
{jsiis + 1, . . . , J — 1, J} where 2^" = [nzs+dj ^ in place of r/®, Xj in place of Xj (where we 
have written u = u — + 2~^~^), and 1-Lq in place of l-L. The sole differences appear in the 
proofs of Theorem 9.1 and Proposition 12.4. Let us start with the proof of Theorem 9.1. 

Proof of Theorem 9.1. Assume we are under (CS2) and want to control the probability of 
deviation of rjj{x) from ri{x) at a point x & A, for some j e Jn- Recall that I-Lq^x) stands 
for the cell Uq = 2-^[0, 1]'=^ centered in x at level j, that is Hoix) = x- 2'^-^ + 2~^[0, 1^ and 
denote by the event 

a = {#{^ : XI G -Hoix)} > 1}. 

We can write 

F{\rj{x) - r]f{x)\ > 5) = P(|r/(x) - r]f{x)\ > S, O.,) 

+ FMx)-vf{x)\>6,0:). 

Focus first on what happens on the event (9^.. The last term can be controlled easily since the 
probability that no single design point of belongs to 'Hq{x) decreases exponentially fast 
with n. Notice indeed that, under (CS2), 

noi) = inx[ i n,{xw 

= (i-p(x; G?/o(x))r 

= 11—/ fi{w)dw j 

< (1 - ^irnu.2-^^X {2^ {A -x)n [-2-\ 2-^Y)Y 

< (l-/i^i„2-^'^min(2mo,2-i)^)" 

< exp(-/^^inmin(2mo,2-^)^n2-^'^), 

where the before last inequality is a direct consequence of (S2) and the last one comes from the 
fact that for any x G [0, 1), ln(l — x) < —x. Now, recall that rf^{x) = on and \rj{x)\ < M 
since r] G ^"(M^, M). So that we obtain 

n\r]{x) - r/f > 5,Ol) < exp(-/i^i, min(2mo, 2-i)^n2-^'^)l|,<M|, 

which is smaller than the first term in the upper-bound of Theorem 9.1. Now focus on what 
happens on the event Ox- We can write 

P(|r^(x) - rif{x)\ > 6, O,) = P(a)E[P(|r^(x) - vf{x)\ > 6\X'J\Ox] 

<E[FMx)-vf{x)\>5\X'J\Ox]- 
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Therefore, it is enough to control the probabihty of deviation of rf^{x) from rj[x) on Ox, 
conditionally on X[^^. It is controlled in exactly the same way as the probability of deviation 
of vif{x) from ri{x) under (CSl), except that we now work with conditional probabilities and 
expectations with respect to X[^. Interestingly, the random variable is independent of the 
points of Vn since it is built upon the design points (X^) of V'^ which are themselves independent 
of the points of Dn- This is a key feature that makes theoretical computations tractable under 
(CS2) and allows to handle 77* in a similar way as rj® under (CSl). As announced above. 
Proposition 12.4 is the sole result that is not obviously true under (CS2). However it can be 
extended to setting (CS2) without much trouble (see below). Ultimately, this proves that, 
on the event Ox and conditionally on X-^, the probability of deviation of rf^{x) from rj{x) 
verifies Theorem 7.1. So that finally, it remains to put everything together to obtain the results 
announced in Theorem 9.1, which concludes the proof. □ 

As detailed in [39], the proof of Proposition 12.4 can be extended to setting (CS2), thanks to 
the local linear independence property of the scaling functions (see Proposition 12.7) and 
a compactness argument. In particular, we obtain the following result, which is proved in [39]. 

Lemma 12.1. Let r G N. Let ip he the Daubechies' scaling function of regularity r and 
& = {u E Z,'^ : 2^^ G Suppipiy}. Then, there exists a strictly positive absolute constant Cmm such 
that 



inf inf inf 

„gSHd_i m>mo 2eBoc(2-i,m) J B^{z,m)n[0,l] 



/ I y^^U„(pi,{w) I dw>Cmin, (26) 

JBoo(z,m)n[0,l]d J 

inf inf / > (pJw)'^dw > Cmin- (27) 

m>mo.eB^(2-i,m)yg^(,_„);^ 



Appendix 

Generalized Lipschitz spaces 

Here, we sum up relevant facts about Lipschitz and Besov spaces on M"^ as stated in [7, Chap. 3] 
for any G N and [11, Chap. 2, §9] for d = 1. Let us denote by '^(M'^) and '^(M'^) the spaces 
of continuous and absolutely continuous functions on M*^, respectively. Let us denote by ||.|| 
the Euclidean norm of W^, f a function defined on M'^ and write A\{f,x) = \ f{x + h) — f{x)\ 
for any x G M*^. For any r G N and all x G M'^, we further define the r'^'-finite difference by 
induction as follows, 

Alif,x) = AliAl-\f,x)), 
and the r^'^-modulus of smoothness of / G '^{W^) as follows 

UJr{f,t)o^= sup ||A^(/, .)||L^(iRd_A)- 

0<\\h\\<t 
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Write s > and r = [sj + 1. The Besov space ^ on Mf^, also known as the generahzed 

Lipschitz space ^'{W^), is the collection of all functions / e ^(M'^) n h^(M.'^, A) such that the 
semi-norm 

Ifl^-iRd) := sup (t~'Wr(/,t)oo) , 
i>0 

is finite. The norm for =Sf*(]R'^) is subsequently defined as 

ll/ll^^(Rd) '■= II/IIl^{K<',A) + l/l^»{R<*)- 

Fix a real number M > 0. Throughout the paper, =Sf'^(M'^,M) refers to the ball of ^*(R'^) of 
radius M. Obviously, the elements of ^''(R'^,M) are A-a.e. uniformly bounded by M on M"'. 
As described in [11, 7], there exists an alternative definition of Lipschitz spaces ^'^(R'^), also 
known as Holder spaces, which goes as follows. For any integer d, multi-index q = {qi, ■ ■ ■ ,qd) ^ 
and x = (xi, . . . , Xd) E M*^, we define the differential operator d'^ as usual hy d'' := gq^^^ gq^x^ - 
For any positive integer s, ^'^(M'^) consists of the functions / on such that (9^/ is bounded 
and absolutely continuous on M'^, for all q E N'^ such that |g|i := qi + . . . + qd < s. This 
definition is extended to non-integer s as follows, 

^^(M^) := {/ e ^{R'^) n Loo(M^ A) : sup Al{f, x) < C\h\'}, < s < 1, 

<^«(M'^) := {/ e #(M'^) n Loo(K^ A) : 

Qif e ^^-'"(R'^), |g|i = m}, m< s <m + l, meN. 

It can be shown that, for all non-integer s > 0, ^'^{W'-) = J:f^(M.^), while '^^^(W^) is a strict 
subset of ^'*(M"') when s G N (see [11, p. 52] for examples of functions that belong to =2'^([0, 1]) 
but not to ^^{[0, 1]) in the particular case where d = 1). 

Furthermore, we define these function spaces on the subset £ of M'^ as the restriction of their 
elements to £. As explained in [7, Remark 3.2.4], function spaces on S can be defined by 
restriction or, alternatively, in an intrinsic way, and both definitions coincide for fairly general 
domains S of M*^. 

Looking at function spaces on £ as function spaces on M.'^ restricted to £ justifies the use of 
MRAs of L2(M'^, A) in our local analysis. 

MRAs and smoothness analysis 

Multivariate MRAs will always be assumed to be obtained from a tensorial product of one- 
dimensional MRAs, as described in [7, §1.4, eq. (1.4.10)]. We will denote by (Pj,k{-) = 2^'^/^ip{2^ .- 
k) the translated and dilated version of (p with k E 'L'^. As usual, we write Vj to mean 
Closure (Spanjyjj^fc, A; G Z'^}), so that Closure(Uj>oVj) = L2(M'^, A) (where the closures are 
taken with respect to the L2(]R'^, A)-metric). 

The r-MRAs defined in Section 5.2 are intimately connected with generalized Lipschitz spaces. 
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Assume we are given a r-MRA with r G N and r/ G ^'{R'^, M), where s G (0, r) and M > 0. De- 
note by ^jt] the orthogonal projection of rj onto Vj and by ^ji] = rj—^jr] the corresponding re- 
mainder. Then, we have for all x G M*^, //(x) = ^^//(a;) +,^j?7(x) where ||=^j'7||Loo(ffi'',A) ^ M2~-'*, 
as detailed in [7, Corollary 3.3.1]. It is noteworthy that the above approximation results re- 
main valid in the particular case where we work on the subset S of M'* and consider rj to be the 
restriction to S of an element of =Sf*(M'^). 
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